Introduction
Neighborhoods with a high queer population, or Gayborhoods, have been urban areas in which LGBTQ+ people have found community and built culture worldwide. However, these geographic areas serve as much more than the sexuality of their constituents, and have been cited as yielding robust creative economies, as well as a welcoming environment for those of many identities. Knowing this, the identity of a certain locale as a Gayborhood becomes a crucial sociological metric, with neighborhoods with a more prevalent queer identity driving social liberalism in the face of prejudice. Our analysis thus focuses on building a location-based regression model that uses a variety of parameters including housing, land use, and non-queer demographic data to predict Gayborhood degree, and, in so doing, determine whether an area is suitable for queer folks with the goal of advancing understanding of liberal areas.
Outcome Variable
In order to rank Gayborhood status, we are using a published dataset from data world1 that was analyzed to display distribution of queer communities in Jan Diehm’s 2018 article Men are from Chelsea, Women are from Park Slope. Specifically, the metrics in this dataset are combined to form a Gayborhood index, which we will use as the supervising variable for our machine learning project. This index is a holistic assessment of queerness, derived primarily from the following measures:
- Same-sex joint tax filers
- Unmarried partner same sex households
- Number of Gay Bars
- Whether or not a pride parade routes through the region
In this case, we will be training a regression model to predict Gayborhood index. The initial variable exists on a 0 to 100 scale with a mean of 2.39, median of 1.27, and a max of 67.1. In the Exploratory Data Analysis section, we will discuss measures to transform the outcome variable into a more favorable distribution.
Data collection
Because this project is based primarily on data by ZCTA, a US Census-defined zipcode proxy, we are able to pool data from two main sources – the US Census and Open ICPSR– to encompass 4 major facets of urban life.
- Our first predictor set2 comes from the US Census, and provides a variety of information about housing characteristics including number of housing units in an area, rent prices, and indicators of development (phone reception, income to rent ratio, etc.). This is derived from the Census’s ACS 5-year survey estimates for the year 2015. We use data from 2015 since that is also when the Gayborhood dataset was published.
- We also used US Census data to glean information about various demographic characteristics, as represented in our second data section3. This set primarily synthesizes demographic information that is not specifically related to sexuality (i.e. the parameters that went into the calculation of our outcome variable), but describe other characteristics of each area.
- We also opted to include data from Open ICPSR, which included information about parks4 in each geographical region, including number of open parks and proportion of ZCTA area that is park space. Since parks are public areas, they serve as a predictive metric for the culture of a community, and are thus important in describing neighborhood lifestyle.
- Finally, we included data concerning the primary method of transport to work5, also from the US Census ACS survey. This will round out our predictors by describing the movement, as well as the physical locale, within a neighborhood.
These yields a functional dataset with > 550 predictors, which fall into the above categories. A skim of the first 10 variables is provided below:
| Name | Piped data |
| Number of rows | 2145 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| zip_code | 0 | 1 | FALSE | 2145 | 017: 1, 017: 1, 017: 1, 017: 1 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| gayborhood_index | 0 | 1 | 1.09 | 0.50 | 0 | 0.77 | 1.08 | 1.41 | 2.6 |
| estimate_sex_and_age_total_population | 0 | 1 | 29847.58 | 19524.92 | 304 | 15415.00 | 26827.00 | 39609.00 | 114982.0 |
| estimate_sex_and_age_total_population_male | 0 | 1 | 14550.75 | 9507.60 | 194 | 7460.00 | 13001.00 | 19268.00 | 60595.0 |
| estimate_sex_and_age_total_population_female | 0 | 1 | 15296.83 | 10078.78 | 110 | 7847.00 | 13772.00 | 20323.00 | 58520.0 |
| estimate_sex_and_age_under_5_years | 0 | 1 | 1924.05 | 1532.47 | 0 | 851.00 | 1584.00 | 2553.00 | 12098.0 |
| estimate_sex_and_age_5_to_9_years | 0 | 1 | 1865.10 | 1456.03 | 0 | 832.00 | 1585.00 | 2497.00 | 10215.0 |
| estimate_sex_and_age_10_to_14_years | 0 | 1 | 1841.39 | 1432.86 | 0 | 810.00 | 1544.00 | 2459.00 | 9785.0 |
| estimate_sex_and_age_15_to_19_years | 0 | 1 | 1880.63 | 1506.29 | 0 | 800.00 | 1527.00 | 2503.00 | 10857.0 |
| estimate_sex_and_age_20_to_24_years | 0 | 1 | 2105.32 | 1824.86 | 0 | 851.00 | 1667.00 | 2829.00 | 18938.0 |
| estimate_sex_and_age_25_to_34_years | 0 | 1 | 4717.06 | 3706.06 | 9 | 2024.00 | 3964.00 | 6408.00 | 26926.0 |
| estimate_sex_and_age_35_to_44_years | 0 | 1 | 4199.07 | 2902.13 | 0 | 2040.00 | 3730.00 | 5609.00 | 18739.0 |
Issues and Challenges
Collecting and cleaning the data was fairly easy. The census data is well organized, and generally has low missingness, which allowed us to only throw out a few potential predictors and will allow us to impute the rest. For cleaning of census data, we also removed columns that contained margins of error and annotations rather than the actual variables. The parks data required the removal of some text form within the data columns and conversion to numeric format. After cleaning individually, all datasets were then combined into a single dataset.
One main concern with our data is the ratio of observations to predictors, since the ratio is currently ~ 1:4. Because of this, most of our feature engineering will rely on targeted dimensionality reduction in order to explain the variance captured in these variables without overfitting to prediction variables.
Because census predictors were numerous, any variables with greater than 10% missing were removed. The remaining missing values are summarized in the table below, with higher missingness listed first. The missingness in these variables are assumed to be randomly missing since they were collected and reported in the US Census ACS survey:
Exploratory Data Analysis
Outcome Variable Transformation
The purpose of this regression problem is to accurately predict the outcome variable of interest, y. Based on the visualization below, a transformation should be applied to the outcome variable as it has a right skew, which will hinder the modeling process. Additionally, notice that the boxplot denotes many outliers within the data, which will lead to poor fits for certain models. This report hopes to apply a transformation such that the transformed y will appear to have data generated from a normal distribution.
We originally planned to apply a log10 transformation to our outcome variable, an index measuring how queer a ZCTA region was based on parameters defined in this dataset from data world6. However, the appearance of success from the log transformation was based off removal of 0 values, which are undefined (mapped to negative infinity in R). Examining the original data more closely, we find the following observations:
- Calculations for
totindex(the index measure) are not easily decipherable, and for a few ZCTAs, seem to be wrong entirely - There exist 0 values when some of the key queerness measures (e.g. number of same sex couples joint filing taxes) are positive
- Some zipcodes that are included in the data have no documented inhabitants (like 20535 and 94128)
In attempts to remedy these issues, we first dropped any zipcodes with \(<100\) inhabitants, and then recalculated our outcome variable, based on the following formula:
\[gayborhood\_index = 40*ss\_tax\_index + 40*ss\_live\_index + 10*parade + 10*bars\] where
- \(ss\_tax\_index\) is a measure of what percent of couples filing jointly are same sex couples (normalized by the maximum value, on a 0 to 1 scale)
- \(ss\_live\_index\) is a measure of what percent of unmarried couples living together are same sex couples (normalized by the maximum value, on a 0 to 1 scale)
- \(parade\) is an indicator variable indicating if a pride parade passes through that zipcode (0 or 1)
- \(bars\) is a measure of how many gay bars there are in that area (normalized by the maximum value, 20, on a 0 to 1 scale)
This calculation accurately repurposes the original dataset for our use, which is not as concerned with splitting up men and women as was the original article Men are from Chelsea, Women are from Park Slope.
We then used a Yeo-Johnson transformation with \(\lambda = -0.268\) (calculated by the applicable step_YeoJohnson()) to transform our data and improve normality. This transformation was chosen as it allows us to keep our 0 values, which have an important physical representation in this dataset, without greatly compromising normality:
Now, although the the transformed outcome variable still does not appear to have been generated from a normal distribution, there are no outliers, which will improve model performance.
Predictor Variable Selection
The provided data listed over 500 possible predictor variables. Fitting models to all potential predictor variables would not only be computationally expensive, but increase the variance of the predictions. Because of these issues, this report aims to mechanically select the most relevant predictors for the regression problem.
In addition to the issue of having many possible predictors variables, there are varying degrees of missingness across the predictors, which is resolved by imputing the variables with values from a K-Nearest Neighbors model.
Next, a LASSO model was fit to the training set to further narrow down the list of predictors. A lenient penalty of 0.01 was employed, along with recipe steps to remove variables with near zero variance, which would have provided little help in the prediction problem. Ultimately, about 100 predictors remained to be used in the model fitting process.
Feature Engineering
did we just not have missingness?? i think there was some or else what are we imputing… lol.
Data Splitting and Folding
For computational reasons, this project
Modeling Process
Overview
Model Types
we fit these model types:
Penalized Elastic Net Regression
- with and without partial least squares
Random Forest + Boosted Tree (XGBoost)
K-Nearest Neighbors
Neural Networks
- Basic neural networks using
nnetin R - Using
kerasand Tensorflow from python through R
- Basic neural networks using
Multivariate Adaptive Regression Splines
Cubist ensemble regression
Model Results
Our results are that… neural networks flopped, whatever the rnoein cubist ensemble regression is did well:
| Initial Model Comparison | |||
| Model Type | Best Mean RMSE | Standard Error | Optimal Hyperparameters |
|---|---|---|---|
| Cubist Ensemble Regression (BI) | 0.328 | 0.004 | Committees = 88, Max Rules = 290, Neighbors = 0 |
| Elastic Net (BI) | 0.337 | 0.004 | Penalty = 1.1e-08, Mixture = 0.051 |
| Random Forest (BI) | 0.339 | 0.003 | Mtry = 91, Trees = 1900, Min N = 3 |
| K-Nearest Neighbors (BI) | 0.343 | 0.003 | Neighbors = 22, Dist Power = 1 |
| Boosted Trees (BI) | 0.348 | 0.003 | Min N = 38, Learn Rate = 0.25, Loss Reduction = 1.2e-09 |
| Elastic Net (PLS) (BI) | 0.370 | 0.003 | Penalty = 0.0032, Mixture = 1 |
| Multivariate Adaptive Regression Splines (BI) | 0.376 | 0.003 | Num Terms = 5, Prod Degree = 1 |
| Neural Network (Keras) | 0.398 | 0.004 | Penalty = 1, Hidden Units = 5 |
| Null Model | 0.503 | 0.002 | NA |
WHATS THE SITCH WITH OTHER RECIPES?? DO WE WANT TO TRY THEM NOW?? probably not, we can say future research would include more recipe steps to test like splines (natural, b, whatever other step_spline stuff there is), PCA, etc.
Ensemble Model
we picked the three best ones
yay for stacks. we might actually want to try baguette instead, i feel like it slays, but i didnt do the reading so i owuldnt know.
final results
happy pride <3 such a well timed project tbh
Footnotes
Jan Diehm. (2018). Gayborhoods [Data set]. The Pudding, data.world. https://data.world/the-pudding/gayborhoods. Accessed 10 April 2023.↩︎
US Census. (2021). DP04: SELECTED HOUSING CHARACTERISTICS [Data set]. data.census.gov. https://data.census.gov/table?q=rent&g=010XX00US$8600000&tid=ACSDP5Y2021.DP04. Accessed 24 April 2023.↩︎
US Census. (2015). DP05: ACS DEMOGRAPHIC AND HOUSING ESTIMATES [Data set]. data.census.gov. https://data.census.gov/table?g=010XX00US$8600000&tid=ACSDP5Y2015.DP05. Accessed 26 April 2023.↩︎
Li, Mao, Melendez, Robert, Khan, Anam, Gomez-Lopez, Iris, Clarke, Philippa, and Chenoweth, Megan. (2020). National Neighborhood Data Archive (NaNDA): Parks by ZIP Code Tabulation Area, United States, 2018. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. https://doi.org/10.3886/E119803V1. Accessed 23 April 2023↩︎
US Census. (2015). S0802: MEANS OF TRANSPORTATION TO WORK BY SELECTED CHARACTERISTICS [Data set]. data.census.gov. https://data.census.gov/table?q=commuting&g=010XX00US$8600000&tid=ACSST5Y2021.S0802. Accessed 30 April 2023.↩︎
Jan Diehm. (2018). Gayborhoods [Data set]. The Pudding, data.world. https://data.world/the-pudding/gayborhoods. Accessed 10 April 2023.↩︎